This tutorial takes you through a simple approach to geocoding addresses using the ggmap package. This package automates the look-up of an address within the Google Geocoding API Application Programming Interface (API) and extracts the latitude and longitude and stores this within a new dataframe.
To be able to use the ggmap library, you will first need to sign up to a Developer account with Google and obtain an API key (essentially a unique token that gives you access to the API. Thereoctically, you will need to pay to use the Google Geocoding API - however, you will have up to $200 in free usage for Maps, Routes and Places APIs each month (and in addition,Google is offering $300 worth of ‘free credit’ for developers when you sign-up to an account). This is a lot of credit on offer - around 2,000 addresses would cost approximately $10!
Once you’ve signed up to the API and get your token, it’s a simple case of ingesting your CSV with your address column and running the code, approximately 10 lines of code, depending on any cleaning or processing you’d need to complete.
This tutorial will walk you through the process. Here we’ll use a dataset taken from the data.gov.uk website that contains a list of schools in England in 2014. You can find the dataset here: https://www.gov.uk/government/publications/schools-in-england - the data itself is out-of-date, but we’re just using this for geocoding and will not use it for any analysis per se. The dataset was downloaded in a Microsoft Excel spreadsheet format - to enable easy ingestion in RStudio, it was opened and then saved as a csv (just the sheet required for this tutorial). You can find the resulting csv in the data folder of this repository, or it is linked here).
First, let’s sign up to the Google Geocoding API and get our API key (also known as an Access Token). If you have a gmail account, then this is super simple. Alternatively, you might want to sign up for a dummy Google account to enable you to use the API easily.
Once you have a Google account, you can follow Google’s instructions in terms of how to acquire an API key here: https://developers.google.com/maps/documentation/geocoding/get-api-key. You may be asked for a credit card to put on file for your account - as long as you stay within your credits (!), you will not be charged.
The Platform page you’ll be navigated to can be quite overwhelming at first to look at - but make sure to follow the steps! You’ll need to create a project for which you want the API key - give it a title that is specific to your project, e.g. ‘Dissertation-geocoding-schools’.
Once you’ve got a project, follow the steps outlined in the link above to create your API key - you’ll be able to find your key in the Credentials section of the platform. The link above also outlines the steps to restrict your key. As your key is linked only to your account, if you share it (accidentally or not), any usage will be applied and billed to your account. Restricting your key helps prevent this from happening - you can at least restrict your key to use within only the Geocoding API. Remember to keep your key private to prevent it from unauthorised use! You can also delete the key once you’ve geocoded your data (and are happy with the results).
Once you’ve got your key, you’re ready to start coding!
Open up R-Studio and start a new script. The first step with any script is to load the packages you’ll need to use - here, we’ll be using ggmap for geocoding as well as the sf package to create spatial data (points) from our extracted lat-lon data, and finally the mapview package. This package will let us load our points data onto a zoomable map, which will allow us to check the success and accuracy of our geocoding. We also load the readr package to help read in a text file used as for our API key (explained below).
Make sure you have the packages installed - if you don’t you can use the install.packages command. And remember to place your libraries in "" within the parenthesis of the library function command.
# Load our libraries
library("readr")
library("ggmap")
## Loading required package: ggplot2
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library("sf")
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library("mapview")
Next, set your working directory.
# Set working directory
# Replace the path below with your file path
setwd("~/R-GIS-Tutorials/")
Then state your API key. Here, we have stored our API key in a text document, which is loaded into this R script and stored as the API_key variable. This prevents us hard-coding our API key into a script that we will share with others and avoid the potential of our API key being used by others.
You can either create a text file and store your API key for use there (recommended) or paste your API key either directly into the function or as a string stored in the API_key variable.
# Load and register API Key for ggmap library
API_key <- read_file("data/API_key.txt")
register_google(key = API_key)
Now we’ve got the “logistics” part of our script completed, we’ll now load our dataset into RStudio and prepare it for geocoding.
First, we read the csv into Rstudio:
# Load our raw CSV dataset - list of schools in England
# State that the first row is a header, and to not interpret our strings as factors
schools_data <- read.csv("data/school_addresses.csv", header=TRUE, stringsAsFactors = FALSE)
# We print the first five lines of our csv to our console to check that the data has loaded correctly
head(schools_data)
## URN Local.authority..code. Local.authority..name. Establishment.number
## 1 100000 201 City of London 3614
## 2 100001 201 City of London 6005
## 3 100002 201 City of London 6006
## 4 100003 201 City of London 6007
## 5 100005 202 Camden 1048
## 6 100006 202 Camden 1100
## Establishment.name Street Locality
## 1 Sir John Cass's Foundation Primary School St James's Passage Duke's Place
## 2 City of London School for Girls St Giles' Terrace Barbican
## 3 St Paul's Cathedral School 2 New Change
## 4 City of London School Queen Victoria Street
## 5 Thomas Coram Centre 49 Mecklenburgh Square
## 6 CCfL Key Stage 4 PRU Agincourt Road
## Address3 Town County Postcode Type.of.establishment
## 1 London EC3A 5DE Voluntary Aided School
## 2 London EC2Y 8BB Other Independent School
## 3 London EC4M 9AD Other Independent School
## 4 London EC4V 3AL Other Independent School
## 5 London WC1N 2NY LA Nursery School
## 6 London NW3 2NY Pupil Referral Unit
## Statutory.highest.age Statutory.lowest.age Boarders
## 1 11 3 No Boarders
## 2 18 7 No Boarders
## 3 13 4 Boarding School
## 4 18 10 No Boarders
## 5 5 3 No Boarders
## 6 16 14 No Boarders
## Sixth.form UKPRN Phase.of.education Gender
## 1 Does not have a sixth form NA Primary Mixed
## 2 Has a sixth form 10013279 Not applicable Girls
## 3 Does not have a sixth form 10018890 Not applicable Mixed
## 4 Has a sixth form 10008165 Not applicable Boys
## 5 Not applicable NA Nursery Mixed
## 6 Not applicable 10016665 Not applicable Mixed
## Religious.character Religious.ethos Admissions.policy
## 1 Church of England Does not apply Not applicable
## 2 None Church of England Not collected
## 3 Church of England Christian Not collected
## 4 None None Not collected
## 5 Does not apply Does not apply Not applicable
## 6 Does not apply Does not apply Not applicable
## Website.address Telephone.number
## 1 www.sirjohncassprimary.org 2072831147
## 2 http://www.clsg.org.uk 2078475500
## 3 http://www.stpauls.co.uk/school/school.htm 2072485156
## 4 http://www.clsb.org/ 2074890291
## 5 http://www.thomascoram.camden.sch.uk/ 2075200385
## 6 http://ccfl.camden.sch.uk 2079748906
## Headteacher Establishment.status Reason.establishment.opened
## 1 Mr Tim Wilson Open Not applicable
## 2 Mrs Ena Harrop Open Not applicable
## 3 Mr Neil Chippington Open Not applicable
## 4 Ms Sarah Fletcher Open Not applicable
## 5 Ms Perina Holness Open Not applicable
## 6 Ms Elizabeth Rattue Open Not applicable
## Opening.date Parliamentary.Constituency..code.
## 1 E14000639
## 2 01/01/1920 E14000639
## 3 01/01/1939 E14000639
## 4 01/01/1919 E14000639
## 5 E14000750
## 6 01/09/1999 E14000750
## Parliamentary.Constituency..name. Region
## 1 Cities of London and Westminster London
## 2 Cities of London and Westminster London
## 3 Cities of London and Westminster London
## 4 Cities of London and Westminster London
## 5 Holborn and St Pancras London
## 6 Holborn and St Pancras London
We can even find out a little more information about the structure of our schools data:
# Get the structure of our data frame
str(schools_data)
## 'data.frame': 24302 obs. of 31 variables:
## $ URN : int 100000 100001 100002 100003 100005 100006 100007 100008 100009 100010 ...
## $ Local.authority..code. : int 201 201 201 201 202 202 202 202 202 202 ...
## $ Local.authority..name. : chr "City of London" "City of London" "City of London" "City of London" ...
## $ Establishment.number : int 3614 6005 6006 6007 1048 1100 1101 2019 2036 2065 ...
## $ Establishment.name : chr "Sir John Cass's Foundation Primary School" "City of London School for Girls" "St Paul's Cathedral School" "City of London School" ...
## $ Street : chr "St James's Passage" "St Giles' Terrace" "2 New Change" "Queen Victoria Street" ...
## $ Locality : chr "Duke's Place" "Barbican" "" "" ...
## $ Address3 : chr "" "" "" "" ...
## $ Town : chr "London" "London" "London" "London" ...
## $ County : chr "" "" "" "" ...
## $ Postcode : chr "EC3A 5DE" "EC2Y 8BB" "EC4M 9AD" "EC4V 3AL" ...
## $ Type.of.establishment : chr "Voluntary Aided School" "Other Independent School" "Other Independent School" "Other Independent School" ...
## $ Statutory.highest.age : int 11 18 13 18 5 16 11 11 11 11 ...
## $ Statutory.lowest.age : int 3 7 4 10 3 14 5 3 3 2 ...
## $ Boarders : chr "No Boarders" "No Boarders" "Boarding School" "No Boarders" ...
## $ Sixth.form : chr "Does not have a sixth form" "Has a sixth form" "Does not have a sixth form" "Has a sixth form" ...
## $ UKPRN : int NA 10013279 10018890 10008165 NA 10016665 NA NA NA NA ...
## $ Phase.of.education : chr "Primary" "Not applicable" "Not applicable" "Not applicable" ...
## $ Gender : chr "Mixed" "Girls" "Mixed" "Boys" ...
## $ Religious.character : chr "Church of England" "None" "Church of England" "None" ...
## $ Religious.ethos : chr "Does not apply" "Church of England" "Christian" "None" ...
## $ Admissions.policy : chr "Not applicable" "Not collected" "Not collected" "Not collected" ...
## $ Website.address : chr "www.sirjohncassprimary.org" "http://www.clsg.org.uk" "http://www.stpauls.co.uk/school/school.htm" "http://www.clsb.org/" ...
## $ Telephone.number : num 2.07e+09 2.08e+09 2.07e+09 2.07e+09 2.08e+09 ...
## $ Headteacher : chr "Mr Tim Wilson" "Mrs Ena Harrop" "Mr Neil Chippington" "Ms Sarah Fletcher" ...
## $ Establishment.status : chr "Open" "Open" "Open" "Open" ...
## $ Reason.establishment.opened : chr "Not applicable" "Not applicable" "Not applicable" "Not applicable" ...
## $ Opening.date : chr "" "01/01/1920" "01/01/1939" "01/01/1919" ...
## $ Parliamentary.Constituency..code.: chr "E14000639" "E14000639" "E14000639" "E14000639" ...
## $ Parliamentary.Constituency..name.: chr "Cities of London and Westminster" "Cities of London and Westminster" "Cities of London and Westminster" "Cities of London and Westminster" ...
## $ Region : chr "London" "London" "London" "London" ...
From this, we’ve got a list of the columns (and examples of their content) and we can see that we have 24,302 schools. This would be quite a lot to geocode - if each school takes 3 seconds to geocode, this could take us up to 20 hours!
As this is a tutorial, we’ll pretend that for our therorectical analysis, we want to focus on the schools within London - hopefully this would mean that we won’t have quite as many schools to process!
To obtain only the schools for London, we’ll create a subset of our schools data frame - we’ll also check the length of our data frame to see how many schools we will end up geocoding.
# Subset to only schools in 'London', i.e. where Town is equal to London
london_schools_data <- subset(schools_data, Town=='London')
# Get the number of observations (we run this on a column to get the number of observations)
length(london_schools_data$URN)
## [1] 1915
Great, we’re under 2,000 observations - which will be much quicker to process and keep us within our usage limits!
Now we want to get our data ready for geocoding. The way in which the ggmap package works is very simple - it is an automated way to access the Geocoding API we now have access to via the API key.
Essentially the package will take each address you provide it with, enter this address into a Google Maps search, and then scrape the results that the search would return. If you ran this search manually, i.e. directly on the Google Maps website, you would see this as a pop-up where you’d be able to find lots of information about the address you’ve provided - including the latitude and longitude. The ggmap looks at the json file behind this pop-up and extracts the values for the latitude and longitude of your address and then stores this in a dataframe for you.
Whilst you could do this yourself by manually populating a csv as you searched for each address, the ggmap package will be substantially faster and less subject to copy and pasting errors! It is however not without fault, and may end up geocoding the wrong location - as a result, you should always double-check your data after geocoding, which we’ll do by mapping our points on a map.
To improve the accuracy of the geocoding, we’ll provide ggmap with as much information as possible. As you can see from the data frame structure above, many of our address components (e.g. school name, street address, postcode) are currently separated out. We will therefore create a new column within our dataframe that constructs a complete address with these components. This column, gg_address, will then be used for geocoding.
# Create new column that joins the establishment name, street and postcode together
london_schools_data$gg_address <- paste(london_schools_data$Establishment.name, london_schools_data$Street, london_schools_data$Postcode, sep= ", ")
Now we have our column for geocoding, we can now run our geocoding process!
Using the ggmap package, we can either choose to geocode each address one by one using the geocode function, or use the geocode_mutate function to batch process our data. This latter code will geocode every address provided and then store the results in a dataframe. It means you do not need to write out a complicated for loop to use the geocode function.
We’ll have a quick look at the geocode function to see what the output would be for one address - and check that our API key set up is working. You can also navigate to the Metrics page of the Geocoding API within your Google Cloud Platofrm to see the request register!
# Geocode the first line in our london_schools_data set:
geocode(london_schools_data$gg_address[1])
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Sir+John+Cass's+Foundation+Primary+School,+St+James's+Passage,+EC3A+5DE&key=xxx
## # A tibble: 1 x 2
## lon lat
## <dbl> <dbl>
## 1 -0.0775 51.5
Great, we can see that a longitude and latitude is provided by our code. To check this quickly, you can open https://www.google.com/maps and enter the latitude and longitude into the search box (note the order is reversed). We can check the location against the first entry of our schools dataset from earlier - and it looks like we’ve managed to gecode Sir John Cass’s Foundation Primary School to it’s location! Great! Now we know our code works and it’s likely to (fingers crossed!) geocode to the right location, we’ll run this on our overall dataset to produce a new dataframe called gg_address_geoc.
We’ll also go make a cup of tea as this might take some time! (Approximately 15 minutes per 2,000 requests!)
# Geocode our dataset, using the gg_address column and the mutate_geocode function
gg_address_geoc <- mutate_geocode(london_schools_data, gg_address)
print("Geocoding complete!")
## [1] "Geocoding complete!"
We can check our final output by using the head command again:
#Check the first couple of lines of our dataset
head(gg_address_geoc)
## X.2 X.1 X URN Local.authority..code. Local.authority..name.
## 1 1 1 1 100000 201 City of London
## 2 2 2 2 100001 201 City of London
## 3 3 3 3 100002 201 City of London
## 4 4 4 4 100003 201 City of London
## 5 5 5 5 100005 202 Camden
## 6 6 6 6 100006 202 Camden
## Establishment.number Establishment.name
## 1 3614 Sir John Cass's Foundation Primary School
## 2 6005 City of London School for Girls
## 3 6006 St Paul's Cathedral School
## 4 6007 City of London School
## 5 1048 Thomas Coram Centre
## 6 1100 CCfL Key Stage 4 PRU
## Street Locality Address3 Town County Postcode
## 1 St James's Passage Duke's Place London EC3A 5DE
## 2 St Giles' Terrace Barbican London EC2Y 8BB
## 3 2 New Change London EC4M 9AD
## 4 Queen Victoria Street London EC4V 3AL
## 5 49 Mecklenburgh Square London WC1N 2NY
## 6 Agincourt Road London NW3 2NY
## Type.of.establishment Statutory.highest.age Statutory.lowest.age
## 1 Voluntary Aided School 11 3
## 2 Other Independent School 18 7
## 3 Other Independent School 13 4
## 4 Other Independent School 18 10
## 5 LA Nursery School 5 3
## 6 Pupil Referral Unit 16 14
## Boarders Sixth.form UKPRN Phase.of.education Gender
## 1 No Boarders Does not have a sixth form NA Primary Mixed
## 2 No Boarders Has a sixth form 10013279 Not applicable Girls
## 3 Boarding School Does not have a sixth form 10018890 Not applicable Mixed
## 4 No Boarders Has a sixth form 10008165 Not applicable Boys
## 5 No Boarders Not applicable NA Nursery Mixed
## 6 No Boarders Not applicable 10016665 Not applicable Mixed
## Religious.character Religious.ethos Admissions.policy
## 1 Church of England Does not apply Not applicable
## 2 None Church of England Not collected
## 3 Church of England Christian Not collected
## 4 None None Not collected
## 5 Does not apply Does not apply Not applicable
## 6 Does not apply Does not apply Not applicable
## Website.address Telephone.number
## 1 www.sirjohncassprimary.org 2072831147
## 2 http://www.clsg.org.uk 2078475500
## 3 http://www.stpauls.co.uk/school/school.htm 2072485156
## 4 http://www.clsb.org/ 2074890291
## 5 http://www.thomascoram.camden.sch.uk/ 2075200385
## 6 http://ccfl.camden.sch.uk 2079748906
## Headteacher Establishment.status Reason.establishment.opened
## 1 Mr Tim Wilson Open Not applicable
## 2 Mrs Ena Harrop Open Not applicable
## 3 Mr Neil Chippington Open Not applicable
## 4 Ms Sarah Fletcher Open Not applicable
## 5 Ms Perina Holness Open Not applicable
## 6 Ms Elizabeth Rattue Open Not applicable
## Opening.date Parliamentary.Constituency..code.
## 1 E14000639
## 2 01/01/1920 E14000639
## 3 01/01/1939 E14000639
## 4 01/01/1919 E14000639
## 5 E14000750
## 6 01/09/1999 E14000750
## Parliamentary.Constituency..name. Region
## 1 Cities of London and Westminster London
## 2 Cities of London and Westminster London
## 3 Cities of London and Westminster London
## 4 Cities of London and Westminster London
## 5 Holborn and St Pancras London
## 6 Holborn and St Pancras London
## gg_address
## 1 Sir John Cass's Foundation Primary School, St James's Passage, EC3A 5DE
## 2 City of London School for Girls, St Giles' Terrace, EC2Y 8BB
## 3 St Paul's Cathedral School, 2 New Change, EC4M 9AD
## 4 City of London School, Queen Victoria Street, EC4V 3AL
## 5 Thomas Coram Centre, 49 Mecklenburgh Square, WC1N 2NY
## 6 CCfL Key Stage 4 PRU, Agincourt Road, NW3 2NY
## lon lat
## 1 -0.0775440 51.51348
## 2 -0.0943486 51.51917
## 3 -0.0968205 51.51386
## 4 -0.0986223 51.51182
## 5 -0.1206182 51.52543
## 6 -0.1601399 51.55386
It looks like we’ve got the structure we expected - but the next question is, did the geocoding work on all of our addresses? To find out, we’ll query whether any observations had NA or NULL values in their latitude column. We can use the which and is.nafunctions to tell us which observations have an na value:
# Identify rows/observations with NA values
which(is.na(gg_address_geoc$lat))
## [1] 126 648 748 848 1385 1431 1684 1911
So we have 8 entries that were not geocoded - not bad considering we have 1915 schools! With this small number, it’s up to you as the analyst to determine how you would try to fill in these data gaps. One approach is to manually geocode them yourself. To do this, we can export gg_address_geoc to a csv, which we can then edit ourselves with the correct longitudes and latitudes as you find them. You can also do this directly in RStudio, using selection and replacement - but we’ll get onto that another time. For now, we can export the dataframe to a csv for use within manual cleaning:
# Export gg_address_geoc dataframe to csv within the data folder
# We will set the row.names function to TRUE so it is easy to identify the 8 observations with NA values (although this would be easy to search in a spreadsheet editor, such as Excel!)
write.csv(gg_address_geoc, "data/london_schools_geocoded.csv", row.names=TRUE)
To keep this tutorial relatively short, we won’t go into the details of editing the exported spreadsheet. In this case, with this few missing entries, it would not take long to do a manual search of Google Maps to try to locate the missing schools.
In addition to the missing values, it would also be a good idea to check the accuracy of the geocoding. To do this, we’ll map the schools and check that they are all located in London. We’ll use the mapview package to display our schools as it provides an interactive map to navigate and check the distribution and metdata of our schools.
To map our data, we’ ’ll need to turn our data frame into a spatial dataset. Here we will use the sf package, although the sp package also works with mapview. To be able to map all of our points, for now, we’ll also remove those eight that were not geocded by creating a subset of our dataset. The code will not work if there are missing values in these columns!
# Create a subset of our geocoded schools data frame to remove those schools without lat/lon data
london_schools_gc <- subset(gg_address_geoc, lat!="NA")
# Create an sf points spatial object, using the lon and lat columns and stating WGS84 as the crs
school.points <- st_as_sf(london_schools_gc, coords = c("lon", "lat"), crs = 4326)
# Launch the mapview plot to check the spatial accuracy of our geocode:
mapview(school.points)
Oh, yikes! It looks like we’ve ended up with a few schools not exactly where we would want them! But the great thing about the mapview plot is that we can click on each of these schools to find out what might have gone wrong - and make a note of them to clean manually in our spreadsheet. It looks like in total we have 6 schools in the USA and one north of Cambridge to relocate.
In total that makes 7 schools to relocate and 8 to add addresses to - considering we have 1915 observations, it’s a much smaller amount to clean/geocode manually then when we started this tutorial! Of course, with these issues, we will need to consider the accuracy of the other 1900 schools, but zooming in on the dataset, it seems to roughly follow the expected shape of London. There are other approaches we could use to check the validity of this final dataset, but we’ll save this for another time.
The next steps from here therefore would be to manually clean and geocode the 15 entries within the exported CSV - and then start a new script where you load this final csv and convert it to a new spatial points object for your analysis!